OMOP
Common Data Model (CDM)
& Extract-Transform-Load (ETL)
Tutorial
Rimma Belenkaya (Memorial Sloan Kettering)
Karthik Natarajan (Columbia University)
Mark Velez (Columbia University)
Erica Voss (Janssen R&D Epidemiology Analytics)
24 September 2016
Please copy the contents of the
USB drive to your hard disk now.
You will need ~45GB free disk space available.
Introduction
Rimma Belenkaya
Memorial Sloan Kettering
Karthik Natarajan
Columbia University
Mark Velez
Columbia University
Erica Voss
Janssen R&D Epidemiology Analytics
2
Teaching Assistants
Anthony Sena
Janssen R&D Epidemiology Analytics
Jungmi Han
Columbia University
3
Ground Rules
We are recording today’s session, so presenters should repeat
questions.
We may table source specific questions.
The Virtual Machine (VM) distributed today on USB, please
return.
If we cannot get the VM working on your machine let’s try to
buddy you up. Do not worry the presentation will still walk
you through the content.
This course will not focus on the Vocabulary, however the
Vocabulary is critical to the Common Data Model and the ETL
process.
4
Agenda
Time Type Section
8:00AM-8:15AM Introductions
8:15AM-9:15AM
Foundational
What is OMOP/OHDSI?
OMOP Common Data Model (CDM) Why and
How
9:15AM-10:00AM How to retrieve data from OMOP CDM
10:00AM-10:15AM
Break
10:15AM-10:45AM
Implementation
Setup and Performing of an Extract Transform
and Load process into the CDM
10:45AM-11:30AM
Using WhiteRabbit and Rabbit-In-A-Hat to Build
an ETL
11:30AM-11:45AM
Evaluation
Testing and Quality Assurance
11:45AM-12:00PM Wrap up
5
Foundational
What is OMOP/OHDSI?
OMOP Common Data Model
(CDM) – Why and How
Introduction of OMOP/OHDSI
OHDSI: Observational Health Data Sciences and
Informatics is a research collaborative coordinated
through Columbia University
Who?
Multiple stakeholders: academia, government, industry
Multiple disciplines: statistics, epidemiology, informatics,
clinical sciences
Why? To generate evidence about all aspects of healthcare
Where? Multiple geographies: US, Europe, Asia-Pacific, 20
countries. OHDSI collaborators access a network of 600 mln
patients
How? By developing analytical methods and tools based on the
data standardized to OMOP Common Data Model (CDM) and
vocabulary
7
OMOP Common Data Model (CDM)
What is it and why have one?
What?
A standardized way to represent data structure (CDM) and
content (vocabulary)
One model to accommodate data coming from disparate data
sources
administrative claims, electronic health records
EHRs from both inpatient and outpatient settings
registries and longitudinal surveys
data sources both within and outside of US
Why?
Enable standardization of structure and content to support a
systematic and reproducible process to efficiently generate
evidence
Support collaborative research both within and outside of US
8
OMOP CDM v5.0.1
Concept
Concept_relationship
Concept_ancestor
Vocabulary
Source_to_concept_map
Relationship
Concept_synonym
Drug_strength
Cohort_definition
Standardized vocabularies
Attribute_definition
Domain
Concept_class
Cohort
Dose_era
Condition_era
Drug_era
Cohort_attribute
Standardized derived
elements
Standardized clinical data
Drug_exposure
Condition_occurrence
Procedure_occurrence
Visit_occurrence
Measurement
Observation_period
Payer_plan_period
Provider
Care_site Location
Death
Cost
Device_exposure
Observation
Note
Standardized health system data
Fact_relationship
Specimen
CDM_source
Standardized meta-data
Standardized
health economics
Person
9
OMOP CDM Design Principles
Relational design but platform independent
Integrated with Controlled Vocabulary
Domain (subject area) based
Patient centric
Uniformly integrates data from heterogeneous
data sources: EMR, claims, registries
Built for analytical purposes, extended/developed
based on analytic use cases
Extendable, both vocabulary (new vocabs, local
concepts) and CDM (Observation)
10
NYC-CDRN Experience
11
OMOP CDM v5.0.1
Concept
Concept_relationship
Concept_ancestor
Vocabulary
Source_to_concept_map
Relationship
Concept_synonym
Drug_strength
Cohort_definition
Standardized vocabularies
Attribute_definition
Domain
Concept_class
Cohort
Dose_era
Condition_era
Drug_era
Cohort_attribute
Standardized derived
elements
Standardized clinical data
Drug_exposure
Condition_occurrence
Procedure_occurrence
Visit_occurrence
Measurement
Observation_period
Payer_plan_period
Provider
Care_site Location
Death
Cost
Device_exposure
Observation
Note
Standardized health system data
Fact_relationship
Specimen
CDM_source
Standardized meta-data
Standardized
health economics
Person
OMOP Common Vocabulary Model
What it is
Standardized structure to
house existing vocabularies
used in the public domain
Compiled standards from
disparate public and private
sources and some OMOP-
grown concepts
Built on the shoulders of
National Library of Medicine’s
Unified Medical Language
System (UMLS)
What it’s not
Static dataset the vocabulary
updates regularly to keep up
with the continual evolution of
the sources
Finished product – vocabulary
maintenance and
improvement is ongoing
activity that requires
community participation and
support
13
OMOP Common Vocabulary Model
14
Single Concept Reference Table
Vocabulary ID
All vocabularies stacked
up in one table
15
What's in a Concept
For use in CDM
English description
Domain
Vocabulary
Class in SNOMED
Concept in data
Valid during time
interval: always
CONCEPT_ID
313217
CONCEPT_NAME Atrial fibrillation
DOMAIN_ID Condition
VOCABULARY_ID SNOMED
CONCEPT_CLASS_ID Clinical Finding
STANDARD_CONCEPT S
CONCEPT_CODE
49436004
VALID_START_DATE 01-Jan-
70
VALID_END_DATE 31-Dec-
99
INVALID_REASON
Code in SNOMED
16
OMOP Vocabulary Model
Design Principles
Uniform structure
All concepts are in one table
All concept relationships are in one table, including
mappings from source to standard vocabularies
Formalized integration with Common Data Model
via concept domain
Direction of ETL is informed by concept domain
Relationships are bi-directional
Hierarchical relationships have additional
representation in the model to support
efficient data retrieval
17
OMOP CDM Standard Domain Features
18
Integration of CDM and Vocabulary
CONCEPT
concept_id: 44821957
concept_name: Atrial fibrillation’
vocabulary_id: ‘ICD9CM
concept_code: 427.31
primary_domain: condition
standard_concept: N
CONCEPT
concept_id: 312327
concept_name: Atrial fibrillation’
vocabulary_id: ‘SNOMED’
concept_code: 49436004
primary_domain: condition
standard_concept: Y
CONDITION_OCCURRENCE
person_id: 123
condition_concept_id: 312327
condition_start_date: 14Feb2013
condition_source_value: ‘41090’
condition_source_concept_id: 44821957
19
OMOP CDM v5.0.1
Concept
Concept_relationship
Concept_ancestor
Vocabulary
Source_to_concept_map
Relationship
Concept_synonym
Drug_strength
Cohort_definition
Standardized vocabularies
Attribute_definition
Domain
Concept_class
Cohort
Dose_era
Condition_era
Drug_era
Cohort_attribute
Standardized derived
elements
Standardized clinical data
Drug_exposure
Condition_occurrence
Procedure_occurrence
Visit_occurrence
Measurement
Observation_period
Payer_plan_period
Provider
Care_site Location
Death
Cost
Device_exposure
Observation
Note
Standardized health system data
Fact_relationship
Specimen
CDM_source
Standardized meta-data
Standardized
health economics
Person
PERSON
Need to create one unique record
per person (not multiple rows per
move)
Vocabulary for gender, race,
ethnicity: HL7 administrative
No history of
location/demographics: need to
select latest available
Location peculiarity: foreign key to
the LOCATION table that contains
one record per each unique location
Year of birth required…day/month
optional
21
LOCATION
Contains one record per
each unique location
Location is highly
variable across sources,
of limited use thus far
22
Spans of time where data source
has capture of data
Required to run analytical
methods
One person may have multiple
periods if there is interruption in
data capture
Challenge: determine observation
periods based on the source data
OBSERVATION_PERIOD
23
DEATH
Can have death without
cause
Can only have 1 death
per person
24
VISIT_OCCURRENCE
Visits <> ‘Encounters’:
claims often need to be consolidated
to minimize double-counting
inpatient transitions are not covered
Visit Types
Inpatient
Emergency room
Inpatient/Emergency - new
Outpatient
Long-term care
Vocabulary: OMOP
Other attributes: time of visit
start/end, provider, admitting
source, discharge disposition
25
PROCEDURE_OCCURRENCE
Vocabularies: CPT-4,
HCPCS, ICD-9 Procedures,
ICD-10 Procedures, LOINC,
SNOMED
Procedures have the least
standardized vocabularies
that causes some
redundancy
26
CONDITION_OCCURRENCE
Vocabulary: SNOMED ->
classification
Data sources:
Billing diagnosis (inpatient,
outpatient)
Problem list
Individual records <>
distinct episodes
27
DRUG_EXPOSURE
Vocabulary: RxNorm->
classifications by drug class
and indication
Data sources:
Pharmacy dispensing
Prescriptions written
Medication history
Source fields may vary, but so
inference of drug exposure
end may vary
28
DEVICE_EXPOSURE
OMOP CDM is the only
data model supporting
devices
Accommodates FDA
unique device
identifiers (UDI) even
though most data
sources don’t have
them yet
29
MEASUREMENT
EAV design
Vocabulary: LOINC, SNOMED
Data sources: structured,
quantitative measures, such
as laboratory tests
Measures have associated
units
Measurement units
vocabulary: UCUM
No free format for
measurement results
30
OBSERVATION
Catch-all EAV design to
capture all other data:
observation: ‘question’
value: ‘answer’
Can be numeric, concept, or
string (e.g. free text)
Instrument for CDM
extension, playpen
Not all ‘questions’ are
standardized, source value
can accommodate ‘custom’
observations (particularly
pertinent in registries)
31
SPECIMEN
To capture of
biomarker / tissue
bank
32
NOTE
To capture
unstructured free
text
Coming soon in CDM
5.x: NLP and LOINC
Clinical Document
Ontology (CDO)
annotations
33
Health Economics
All costs consolidated
into one table COST
table
Costs tied to
respective
observation records
Domain is
determined by
cost_domain_id (e.g.
visit, condition, etc.)
34
OMOP CDM Service Tables
CDM_SOURCE
Provenance, integration, metadata
Future extension to individual domains
FACT_RELATIONSHIP
Linkage between related observations
Example: systolic and diastolic blood pressure
35
Motivation for Standardized Derived
Elements
Derived elements intended to supplement- not replace- raw data
If derived assumptions don’t meet a specific use case, don’t use them
Promotes transparency and consistency in research by having
standard processes applies across analyses
Increased efficiency by processing key data elements once at ETL-
time, rather than requiring each analysis to figure it out at each
analysis run-time
Key standardized elements available in OMOP CDMv5:
Cohort standardize definition and syntax for defining populations that
meet inclusion criteria
Drug era standardize inference of length of exposure to product for all
active ingredients
Dose era standardize estimation of daily dose for periods of exposure
to all drug products
Condition era standardize aggregation of episodes of care, delineating
between acute vs. chronic conditions
36
Cohort Management
1. COHORT table contains records of subjects that satisfy a given set of criteria for a duration of
time.
2. The definition of the cohort is contained within the COHORT_DEFINITION table. It provides a
standardized structure for maintaining the rules governing the inclusion of a subject into a
cohort, and can store programming code to instantiate the cohort within the OMOP CDM.
3. COHORT_ATTRIBUTE table contains attributes associated with each subject within a cohort, as
defined by a given set of criteria for a duration of time.
4. The definition of the Cohort Attribute is contained in the ATTRIBUTE_DEFINITION table.
37
DRUG_ERA
Standardized
inference of length of
exposure to product
for all active
ingredients
Derived from records
in DRUG_EXPOSURE
under certain rules to
produce continuous
Drug Eras
38
Illustrating inferences needed within longitudinal
pharmacy claims data for one patient
Person Timeline
NDC: 00179198801
Lisinopril 5 MG Oral Tablet
Prescription dispensing
(Fill date + days supply)
NDC: 00310013010
ZESTRIL 5 MG
TABLET
NDC: 00038013134
Lisinopril 10 MG Oral Tablet
[Zestril]
NDC: 00038013210
Lisinopril 20 MG Oral Tablet
[Zestril]
NDC: 58016078020
Hydrochlorothiazide 12.5 MG /
Lisinopril 20 MG Oral Tablet
[Zestoretic]
60d
30d
30d
gap
How do we
handle overlap?
How do we
handle gaps?
How do we handle
change in dose?
How do we handle
combination products?
Lisinopril era 1 Era 2
How do we handle
NDC change?
How do we infer
discontinuation?
X
How do we handle
reversals?
What makes OMOP CDM unique
Specialized CDM - reflective of clinical domain,
granular, well structured
Vocabulary - uniformly structured and well curated
Information Model - formalized connection between
data model and conceptual model (Vocabulary)
Specialized yet Extendable new attributes and
concepts can be added
Supportive Community of developers and researchers
Development driven by analytic use cases
Foundational
How to retrieve data from
OMOP CDM
OHDSI in a Box
WebAPI
Atlas
Penelope
Calypso
synpuf_100k
WhiteRabbit
RabbitInAHat
cdm webapi
pgAdmin
Broadsea
OHDSI R
packages
Studio
Methods Library
WebTools
Tomcat
42
OHDSI in a Box – Setup
1. Open VM VirtualBox Manager
2. Click on
1.
43
OHDSI in a Box Start Up
44
OHDSI in a Box International Keyboards
45
OHDSI in a Box Adjust Resolution
46
OHDSI in a Box – Clipboard
47
OHDSI in a Box Timeout
48
OHDSI in a Box Ready
49
CDM Database pgAdmin III New
Server
50
CDM Database Connect
51
CDM Database Open SQL Sheet
52
CDM Database Ready
53
Data Used for Demonstration
Medicare Claims Synthetic Public Use Files
(SynPUFs)
synthetic US Medicare insurance claims database
Medicare is a government based insurance
program for primarily 65 and older but also
individuals with disabilities
SynPUF not for research but rather
demonstration/development purposes
Has been converted to the Common Data Model
https://www.cms.gov/research-statistics-data-and-systems/downloadable-public-use-files/synpufs/
54
Data Used for Demonstration
Five types of data:
DE-SynPUF
Unit of
record
Number of
Records
2008
Number of
Records
2009
Number of
Records
2010
Beneficiary Summary
Benefi-
ciary
2,326,856 2,291,320 2,255,098
Inpatient Claims claim 547,800 504,941 280,081
Outpatient Claims claim 5,673,808 6,519,340 3,633,839
Carrier Claims claim 34,276,324 37,304,993 23,282,135
Prescription Drug
Events (PDE)
event 39,927,827 43,379,293 27,778,849
https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html
55
SynPUF High Level Diagram
SYNPUF DIAGRAM
Beneficiary
Summary
Prescription Drug Events
(PDE)
Inpatient Claims
Carrier Claims
Outpatient Claims
56
SYNPUF DIAGRAM
Beneficiary
Summary
Prescription Drug
Events (PDE)
Inpatient Claims
Carrier Claims
Outpatient Claims
Cohort
Dose_era
Condition_era
Drug_era
Cohort_attribute
Standardized derived
elements
Standardized clinical data
Drug_exposure
Condition_occurrence
Procedure_
occurrence
Visit_occurrence
Measurement
Payer_plan_period
Provider
Care_site Location
Death
Cost
Device_exposure
Observation
Observation_period
Note
Standardized health system data
Fact_relationship
Specimen
Standardized health
economics
Person
Mapping SynPUF to CDM
SynPUF
CDM
57
Some Example Questions
New Users of Warfarin
New Users of Warfarin
who are >=65?
New Users of Warfarin
with prior Atrial Fibrillation?
Ex 1
Ex 2
Ex 3
58
Warfarin is a blood thinner that is used to
treat/prevent blood clots.
Where do you find drug data in the CDM?
What codes do I use to define drugs?
What does “New User” mean?
New Users of Warfarin
Ex 1
59
Concept
Concept_relationship
Concept_ancestor
Vocabulary
Source_to_concept_map
Relationship
Concept_synonym
Drug_strength
Cohort_definition
Standardized vocabularies
Attribute_definition
Domain
Concept_class
Cohort
Dose_era
Condition_era
Drug_era
Cohort_attribute
Standardized derived
elements
Standardized clinical data
Drug_exposure
Condition_occurrence
Procedure_occurrence
Visit_occurrence
Measurement
Observation_period
Payer_plan_period
Provider
Care_site Location
Death
Cost
Device_exposure
Observation
Note
Standardized health system data
Fact_relationship
Specimen
CDM_source
Standardized meta-data
Standardized
health economics
Person
captures records about the utilization of
a drug when ingested or otherwise
introduced into the body
Where are Drug Exposures in the
CDM?
60
How do I define Warfarin?
When raw data is transformed into the CDM raw
source codes are transformed into standard
OMOP Vocabulary concepts
In the CDM, we no longer care what source
concepts existed in the raw data, we just need to
use concept identifiers
We can use the OMOP Vocabulary to identify all
concepts that contain the ingredient warfarin
Ex 1
61
How do I define Warfarin?
OHDSI Tool ATLAS
Ex 1
Writing SQL Statement
SQL
62
someone who has recently started taking the
drug, typically with a 6 or 12 month wash out
How do I define new users of a drug?
2007 2008 2009 2010 2011 2012 2013 2014 2015
Ex 1
63
How do I define new users of a drug?
time in
database
index
drug
6 months
someone who has recently started taking the
drug, typically with a 6 or 12 month wash out
Ex 1
64
What is Needed in the CDM?
OMOP Vocabulary
to find the concepts
DRUG_EXPOSURE
to find individuals with exposure
OBSERVATION_PERIOD
to know people’s time within the database
Ex 1
65
New Users of Warfarin
Ex 1
66
Step 1: Get the codes you need
Ex 1
67
Step 2: Find Drug Exposures
Ex 1
68
Step 3: Find New Users
Ex 1
69
Try running this on your own!
New Users of Warfarin
How many people do you get?
18,080 individuals
Ex 1
70
someone who has recently started taking the
drug, typically with a 6 or 12 month wash out
How do I define new users of
warfarin who are >=65?
time in
database
index
drug
6 months
>=65
years old
Ex 2
71
What is Needed in the CDM?
OMOP Vocabulary
to find the concepts
DRUG_EXPOSURE
to find individuals with exposure
OBSERVATION_PERIOD
to know people’s time within the database
PERSON
to know year of birth
Ex 2
72
Step 1: Start with the previous
query
Ex 2
73
Step 2: Add the Person Table to
calculate age
Ex 2
74
Try running this on your own!
How many people do you get?
14,946 individuals
New Users of Warfarin
<= 65 years of age
Ex 2
75
How do I define new users of Warfarin
with prior Atrial Fibrillation?
time in
database
index
drug
6 months
prior AFIB
Ex 3
76
What is Needed in the CDM?
OMOP Vocabulary
to find the concepts
DRUG_EXPOSURE
to find individuals with exposure
OBSERVATION_PERIOD
to know people’s time within the database
CONDITION_OCCURRENCE
to find presence of a disease
Ex 3
77
Step 1: Start with the Ex 1 query
Ex 3
78
Step 2: Define Atrial Fibrillation
Ex 3
79
Step 3: Prior Atrial Fibrillation
Keeps condition within the
same observable time,
exclude if you want all time
prior
Ex 3
80
How do I define new users of Warfarin
with prior Atrial Fibrillation?
time in
database
index
drug
6 months
prior AFIB
observation
time
observation
time
Ex 3
81
Try running this on your own!
How many people do you get?
10,005 individuals
New Users of Warfarin
with prior Atrial Fibrillation
Ex 3
82
Try on your own!
8,207 individuals
3,148 individuals
Warfarin New Users 65 or Older at Index with
Prior Atrial Fibrillation
Bonus: Clipidogrel New Users 65 or Older at
Index with Prior Atrial Fibrillation
8,207 individuals
3,148 individuals
83
Queries Can Be Automated
Open up Google Chrome
Navigate to:
http://localhost:8080/atlas/#/home
In Atlas navigate to Cohorts
There should be a pre-existing cohort called
“Warfarin New Users 65 or Older at Index
with Prior Atrial Fibrillation.”
84
Queries Can Be Automated
85
Break
Please return in 15 minutes
86
Implementation
Setup and Performing of an
Extract Transform and Load
process into the CDM
Brief Review
Foundational
OHDSI - Why and how
OMOP CDM - Standardizing structure & content
Real-world examples (SQL and Atlas)
88
How do we create our own OMOP
CDM instance?
Extract Load
cdm
source
1
source
2
source
3
Transform
89
ETL: Real world scenario
Truven MarketScan Commercial Claims and Encounters (CCAE)
INPATIENT_SERVICES
enrolid
admdate
pdx dx1
dx2
dx3
1570337021
5/31/2000
41071 41071
4241
V5881
Optum Extended SES
MEDICAL_CLAIMS
patid
fst_dt
diag1 diag2
diag3
diag4
259000476532
5/30/2000
41071 27800
4019
2724
Premier
PATICD_DIAG
pat_key
period
icd_code icd_pri_sec
-
17197140
1/1/2000
410.71 P
-
17197140
1/1/2000
414.01 S
-
17197140
1/1/2000
427.31 S
-
17197140
1/1/2000
496 S
Japan Medical Data Center
DIAGNOSIS
member_id
admission_date
icd10_level4_code
M0041437
4/11/2013
I214
M0041437
4/11/2013
A539
M0041437
4/11/2013
B182
M0041437
4/11/2013
E14-
4 real observational databases, all containing
an inpatient admission for a patient with a
diagnosis of ‘acute subendocardial infarction’
Not a single table name the same…
Not a single variable name the same….
Different table structures (rows vs.
columns)
Different conventions (with and without
decimal points)
Different coding schemes (ICD9 vs. ICD10)
What does it mean to ETL to OMOP CDM?
Standardize structure and content
Truven MarketScan Commerical Claims and Encounters (CCAE)
INPATIENT_SERVICES
enrolid
admdate
pdx dx1
dx2
dx3
1570337021
5/31/2000
41071 41071
4241
V5881
Truven MarketScan Commerical Claims and Encounters (CCAE)
CONDITION_OCCURRENCE
CONDITION_
START_DATE
CONDITION_
SOURCE_VA
LUE
CONDITION_TYPE_CONCEPT_ID
157033702
5/31/2000
41071
Inpatient claims - primary position
157033702
5/31/2000
41071
Inpatient claims - 1st position
157033702
5/31/2000
4241
Inpatient claims - 2nd position
157033702
5/31/2000
V5881
Inpatient claims - 3rd position
Structure optimized for large-scale analysis for clinical
characterization, population-level estimation, and patient-
level prediction
Content using international vocabulary standards that can be
applied to any data source
Truven MarketScan Commerical Claims and Encounters (CCAE)
CONDITION_OCCURRENCE
PERSON_ID
CONDITION
_START
_DATE
CONDITION
_SOURCE
_VALUE
CONDITION _TYPE
_CONCEPT_ID
CONDITION
_SOURCE
_CONCEPT_ID
CONDITION
_CONCEPT_ID
157033702
5/31/2000
41071
Inpatient claims
-
primary position
44825429
444406
OMOP CDM = Standardized structure:
same tables, same fields, same datatypes,
same conventions across disparate sources
Truven CCAE: CONDITION_OCCURRENCE
PERSON_ID
CONDITION_
START_DATE
CONDITION
_SOURCE_V
ALUE
CONDITION_TYPE_CONCEPT_ID
157033702
5/31/2000
41071
Inpatient claims
- primary
position
157033702
5/31/2000
41071
Inpatient claims
- 1st position
157033702
5/31/2000
4241
Inpatient claims
- 2nd position
157033702
5/31/2000
V5881
Inpatient claims
- 3rd position
Optum Extended SES: CONDITION_OCCURRENCE
PERSON_ID
CONDITION_
START_DATE
CONDITION
_SOURCE_V
ALUE
CONDITION_TYPE_CONCEPT_ID
259000474406532
5/30/2000
41071
Inpatient claims
- 1st position
259000474406532
5/30/2000
27800
Inpatient claims
- 2nd position
259000474406532
5/30/2000
4019
Inpatient claims
- 3rd position
259000474406532
5/30/2000
2724
Inpatient claims
- 4th position
Premier : CONDITION_OCCURRENCE
PERSON_ID
CONDITION_
START_DATE
CONDITION
_SOURCE_V
ALUE
CONDITION_TYPE_CONCEPT_ID
-
171971409
1/1/2000
410.71
Hospital record
- primary
-
171971409
1/1/2000
414.01
Hospital record
- secondary
-
171971409
1/1/2000
427.31
Hospital record
- secondary
-
171971409
1/1/2000
496
Hospital record
- secondary
JMDC :
CONDITION_OCCURRENCE
PERSON_ID
CONDITION_
START_DATE
CONDITION
_SOURCE_V
ALUE
CONDITION_TYPE_CONCEPT_ID
4149337
4/11/2013
I214
Inpatient claims
4149337
4/11/2013
A539
Inpatient claims
4149337
4/11/2013
B182
Inpatient claims
4149337
4/11/2013
E14
-
Inpatient claims
Consistent structure optimized for large-
scale analysis
Structure preserves all source content and
provenance
OMOP CDM = Standardized content:
common vocabularies across disparate
sources
Truven CCAE: CONDITION_OCCURRENCE
PERSON_ID
CONDITION
_START
_DATE
CONDITION
_SOURCE
_VALUE
CONDITION _TYPE
_CONCEPT_ID
CONDITION
_SOURCE
_CONCEPT_ID
CONDITION
_CONCEPT_ID
157033702
5/31/2000
41071
Inpatient claims
-
primary position
44825429
444406
Optum Extended SES: CONDITION_OCCURRENCE
PERSON_ID
CONDITION
_START
_DATE
CONDITION
_SOURCE
_VALUE
CONDITION _TYPE
_CONCEPT_ID
CONDITION
_SOURCE
_CONCEPT_ID
CONDITION
_CONCEPT_ID
259000474406532
5/30/2000
41071
Inpatient claims
- 1st
position
44825429
444406
Premier : CONDITION_OCCURRENCE
PERSON_ID
CONDITION
_START
_DATE
CONDITION
_SOURCE
_VALUE
CONDITION _TYPE
_CONCEPT_ID
CONDITION
_SOURCE
_CONCEPT_ID
CONDITION
_CONCEPT_ID
-
171971409
1/1/2000
410.71
Hospital record
-
primary
44825429
444406
JMDC : CONDITION_OCCURRENCE
PERSON_ID
CONDITION
_START
_DATE
CONDITION
_SOURCE
_VALUE
CONDITION _TYPE
_CONCEPT_ID
CONDITION
_SOURCE
_CONCEPT_ID
CONDITION
_CONCEPT_ID
4149337
4/11/2013
I214
Inpatient claims
45572081
444406
Standardize source
codes to be uniquely
defined across all
vocabularies
No more worries
about formatting or
code overlap
Standardize across
vocabularies to a
common referent
standard
(ICD9/10→SNOMED)
Source codes mapped
into each domain
standard so that now
you can talk across
different languages
ETL Process: Roles
Members of the team
CDM expert
Local data expert
Data engineer
Person with medical knowledge
Business stakeholder
94
ETL Process: Agile
During Sprint
Post Sprint
ETL Specs Develop
Execute Evaluation
Business
Validation/Sign-
off
Issues/Corrections
Setup
Sprint 0
Sprint 1
Sprint 2
Sprint 3
Sprint 4
Release
95
Example OHDSI ETL Process
Sprint 0
Location
Care site
Person
Provider
Condition
Death
Organization
Sprint 1
Procedure
Occurrence
Observation
Payer plan
period
Drug Cost
Procedure
Cost
Sprint 2
Drug Exposure
Sprint 3
Drug Era
Condition Era
Observation
Period
Visit
Occurrence
Sprint 4
Finalize ETL
Specs
Sprint 0
Initial Data
On-boarding
Sprint 1
Location
Care site
Person
Provider
Condition
Death
Organization
Sprint 2
Procedure
Occurrence
Observation
Payer plan
period
Drug Cost
Procedure
Cost
Sprint 3
Drug Exposure
Sprint 4
Drug Era
Condition Era
Observation
Period
Visit
Occurrence
Analysis – Creation of ETL Specs/Stories
Development Implementation/Validation of ETL Specs
For each table:
Backlog
White Rabbit
Vocabulary Mapping
ETL specs
96
OHDSI Resources for ETL
During Sprint
Post Sprint
ETL Specs Develop
Execute Evaluation
Business
Validation/Sign-off
Issues/Corrections
Rabbits Usagi
Atlas Data Sources
(Achilles)
97
Best Practices Documented
http://www.ohdsi.org/web/wiki/doku.php?id=d
ocumentation:etl_best_practices
98
Implementation
Using WhiteRabbit and
Rabbit-In-A-Hat to Build an
ETL
Getting WhiteRabbit
https://github.com/OHDSI/WhiteRabbit
Click on “releases”
Find the
“Latest Release”
and download the
WhiteRabbit zip file
100
Getting WhiteRabbit
Save the ZIP file somewhere and extract the
files
Double-click on the WhiteRabbit.jar to start
the application.
101
Working with WhiteRabbit
Wiki:
http://www.ohdsi.org/web/wiki/doku.php?id
=documentation:software:whiterabbit
1. Specify the location of your data
Supports database connections as well as text
files
2. Scanning your Database
Characterizes your data
102
Specify the Location of Data
103
Specify the Location of Data
104
Scanning your Data
105
Scanning your Data
106
Scanning your Data
107
Link on desktop
Execute
WhiteRabbit appears
Run the Scan Report
on Your Data!
108
Set the “Working Folder” to
/home/ohdsi/whiterabbit/SynPUFSmall
Press “Test connection
Move over to the “Scan” tab, and hit the “Add” button.
Select the CSVs in the folder.
Keep the default settings and press “Scan tables”.
Scan report is created in the folder you specified on the
“Locations” tab as “ScanReport.xlsx”.
Run the Scan Report
on Your Data!
109
Reading the Scan
Series of tabs in an XLSX file
Overview Tab
provides the definition of each table analyzed,
there will only be one tab of this type
Table Tab(s)
a summary column for every column, there will
be as many tabs as tables selected to analyze
110
Overview Tab
defines the tables you scanned
111
Table Tabs
Definition of the Beneficiary_Summary.csv table and each
record pertains to a synthetic medicare beneficiary
Beneficiary_Summary.csv
112
Read the Scan Report
Open up the scan report generated
Go to the “Inpatient_Claims.csv” tab which represents claims
processed from inpatient setting.
What is the most common admitting diagnosis code?
Hints:
ADMTNG_ICD9_DGNS_CD
ICD9 codes are in ###.## format
You can use ATLAS to look it up
786.05 - Shortness of breath
113
Rabbit In a Hat
Already part of the WhiteRabbit download
Uses the information from WhiteRabbit to
help you produce documentation for the ETL
process
Helps you define the logic in a consistent way
does not generate code to create ETL
114
Getting Started
Double-click on the RabbitInAHat.jar to start
the application.
File → Open Scan Report and navigate to the
scan report that was just created.
115
Process for Developing ETL
Get the right people in the room
Block off time necessary
Map all the raw data tables to CDM tables
Then go back through and provide detailed mapping
information for each raw data table to CDM table
Generate final ETL document
116
Map Raw Tables
to CDM Tables
117
Map Raw Tables
to CDM Tables
118
Map Raw Tables
to CDM Tables
Prescription Drug Events contain
records pertaining to Part D synthetic
events. Medicare Part D are
prescription drug benefits.
119
Map Raw Tables
to CDM Tables
Synpuf contains drug codes like NDC
52555011101 - Clofibrate 500 MG Oral
Capsule” which maps to the standard
concept 1598659 - Clofibrate 500 MG
Oral Capsule”
120
Map Raw Tables
to CDM Tables
Cannot forget Carrier Claims, Inpatient
Claims, and Outpatient Claims as well.
They contain procedure codes like
HCPCs that may convert into drugs as
well.
121
Map Raw Tables
to CDM Tables
HCPC (a procedure code) J9310 -
“Injection, rituximab, 100 mg” maps to
standard concept 46275076 - “rituximab
Injection” which belongs to the drug
domain, not the procedure domain.
122
Map Raw Tables
to CDM Tables
Continue mapping raw
tables to CDM tables
until you feel
confident you are
bringing over as
much raw data as
necessary
123
PERSON
For today’s example we’ll start with the
PERSON table
124
125
126
DRUG_EXPOSURE
Try drawing arrows from PRESCRIPTION_DRUG_EVENTS
columns to DRUG_EXPOSURE columns
Focus on:
PERSON_ID
DRUG_EXPOSURE_START_DATE
QUANTITY
DAYS_SUPPLY
DRUG_SOURCE_VALUE
127
DRUG_EXPOSURE
128
DRUG_EXPOSURE
Mapping source codes to standard
terminology
Source to Source
Source to Standard
Use standard query for both, just define
filters needed
Standard
Query
J9310 - “Injection,
rituximab, 100 mg”
46275076 -
“rituximab Injection”
129
Standard Query:
Source to Standard
WITH CTE_VOCAB_MAP AS (
SELECT c.concept_code AS SOURCE_CODE, c.concept_id AS SOURCE_CONCEPT_ID, c.concept_name AS
SOURCE_CODE_DESCRIPTION, c.vocabulary_id AS SOURCE_VOCABULARY_ID, c.domain_id AS
SOURCE_DOMAIN_ID, c.CONCEPT_CLASS_ID AS SOURCE_CONCEPT_CLASS_ID, c.VALID_START_DATE AS
SOURCE_VALID_START_DATE, c.VALID_END_DATE AS SOURCE_VALID_END_DATE, c.INVALID_REASON AS
SOURCE_INVALID_REASON,c1.concept_id AS TARGET_CONCEPT_ID, c1.concept_name AS
TARGET_CONCEPT_NAME, c1.VOCABULARY_ID AS TARGET_VOCABUALRY_ID, c1.domain_id AS
TARGET_DOMAIN_ID, c1.concept_class_id AS TARGET_CONCEPT_CLASS_ID, c1.INVALID_REASON AS
TARGET_INVALID_REASON, c1.standard_concept AS TARGET_STANDARD_CONCEPT
FROM CONCEPT C
JOIN CONCEPT_RELATIONSHIP CR ON C.CONCEPT_ID = CR.CONCEPT_ID_1 AND CR.invalid_reason IS NULL
AND cr.relationship_id = 'Maps to'
JOIN CONCEPT C1 ON CR.CONCEPT_ID_2 = C1.CONCEPT_ID AND C1.INVALID_REASON IS NULL
UNION
SELECT source_code, SOURCE_CONCEPT_ID, SOURCE_CODE_DESCRIPTION, source_vocabulary_id, c1.domain_id
AS SOURCE_DOMAIN_ID, c2.CONCEPT_CLASS_ID AS SOURCE_CONCEPT_CLASS_ID, c1.VALID_START_DATE AS
SOURCE_VALID_START_DATE, c1.VALID_END_DATE AS SOURCE_VALID_END_DATE, stcm.INVALID_REASON AS
SOURCE_INVALID_REASON,target_concept_id, c2.CONCEPT_NAME AS TARGET_CONCEPT_NAME,
target_vocabulary_id, c2.domain_id AS TARGET_DOMAIN_ID, c2.concept_class_id AS TARGET_CONCEPT_CLASS_ID,
c2.INVALID_REASON AS TARGET_INVALID_REASON, c2.standard_concept AS TARGET_STANDARD_CONCEPT
FROM source_to_concept_map stcm
LEFT OUTER JOIN CONCEPT c1 ON c1.concept_id = stcm.source_concept_id
LEFT OUTER JOIN CONCEPT c2 ON c2.CONCEPT_ID = stcm.target_concept_id
WHERE stcm.INVALID_REASON IS NULL
)
SELECT *
FROM CTE_VOCAB_MAP
/*EXAMPLE FILTERS*/
WHERE SOURCE_VOCABULARY_ID IN ('NDC')
AND TARGET_STANDARD_CONCEPT IN ('S')
130
Standard Query:
Source to Source
WITH CTE_VOCAB_MAP AS (
SELECT c.concept_code AS SOURCE_CODE, c.concept_id AS SOURCE_CONCEPT_ID, c.CONCEPT_NAME AS
SOURCE_CODE_DESCRIPTION, c.vocabulary_id AS SOURCE_VOCABULARY_ID, c.domain_id AS
SOURCE_DOMAIN_ID, c.concept_class_id AS SOURCE_CONCEPT_CLASS_ID, c.VALID_START_DATE AS
SOURCE_VALID_START_DATE, c.VALID_END_DATE AS SOURCE_VALID_END_DATE, c.invalid_reason AS
SOURCE_INVALID_REASON, c.concept_ID as TARGET_CONCEPT_ID, c.concept_name AS
TARGET_CONCEPT_NAME, c.vocabulary_id AS TARGET_VOCABULARY_ID, c.domain_id AS TARGET_DOMAIN_ID,
c.concept_class_id AS TARGET_CONCEPT_CLASS_ID, c.INVALID_REASON AS
TARGET_INVALID_REASON,c.STANDARD_CONCEPT AS TARGET_STANDARD_CONCEPT
FROM CONCEPT c
UNION
SELECT source_code, SOURCE_CONCEPT_ID, SOURCE_CODE_DESCRIPTION, source_vocabulary_id, c1.domain_id
AS SOURCE_DOMAIN_ID, c2.CONCEPT_CLASS_ID AS SOURCE_CONCEPT_CLASS_ID, c1.VALID_START_DATE AS
SOURCE_VALID_START_DATE, c1.VALID_END_DATE AS SOURCE_VALID_END_DATE,stcm.INVALID_REASON AS
SOURCE_INVALID_REASON,target_concept_id, c2.CONCEPT_NAME AS TARGET_CONCEPT_NAME,
target_vocabulary_id, c2.domain_id AS TARGET_DOMAIN_ID, c2.concept_class_id AS TARGET_CONCEPT_CLASS_ID,
c2.INVALID_REASON AS TARGET_INVALID_REASON, c2.standard_concept AS TARGET_STANDARD_CONCEPT
FROM source_to_concept_map stcm
LEFT OUTER JOIN CONCEPT c1 ON c1.concept_id = stcm.source_concept_id
LEFT OUTER JOIN CONCEPT c2 ON c2.CONCEPT_ID = stcm.target_concept_id
WHERE stcm.INVALID_REASON IS NULL
)
SELECT *
FROM CTE_VOCAB_MAP
/*EXAMPLE FILTERS*/
WHERE SOURCE_VOCABULARY_ID IN ('ICD9CM')
AND TARGET_VOCABULARY_ID IN ('ICD9CM')
131
Example Filters: NDCs
Source to Standard
Source to Source
Some maps are date sensitive like NDC
or DRGs
Review for incorrect mappings (e.g. source codes
might map to multiple SOURCE_VOCAB_IDs)
132
Saving and
Export to Document
Save working document
Export to document
133
Evaluation
Testing and Quality
Assurance
ACHILLES
Interactive platform to visualize data in CDM
patient demographics
prevalence of conditions, drugs and procedures
distribution of values for clinical observations
https://github.com/OHDSI/Achilles
135
136
137
138
139
140
141
ETL Pitfalls
Privacy Issues
Removal of ICD9/10 codes that are considered
privacy issues, such as death or sexual abuse
Using “fake” date in Death table to indicate a death
Patient Cleansing
Test patients
Differing Business Rules
Institutions decide not to follow vocabulary
classifications
142
Conclusion
CDM standardizes the structure
CDM is a patient centric model
ETL Process is an Iterative One
White Rabbit your data to
understand it
Plan your ETL in Rabbit in a Hat
OMOP Vocabulary used to
standardize the terminology
OHDSI is an Open Source
Collaborative Community
ACHILLES Helps you evaluate
your data
1
2
3
4
5
6
7
8
143
Join the journey
Interested in OHDSI?
Join the Journey:
http://www.ohdsi.org/join-the-journey/
Questions:
http://forums.ohdsi.org/
144
APPENDIX
USAGI
Tool to help in mapping codes from a source
system into the standard terminologies stored
in OMOP Vocabulary
http://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:usagi
146
USAGI Exercise
Terminal Window
Sample File
cd usagi
java -jar Usagi_v0.3.3.jar
147
USAGI Exercise
Need to tell usagi
which column is what
148
USAGI Exercise
149
USAGI Exercise
150
USAGI Exercise
151
USAGI Exercise
152